Matching in R

Biostatistics Causal Inference Tutorial R

Observational study;
Causal Inference;
Matching;
R;

Hai Nguyen
July 03, 2021
library(DOS) # Design of Observational Studies
library(tidyverse)
library(kableExtra)
library(optmatch)

Multivariate matching is involved in 3 steps:
(i) creating a distance matrix,
(ii) adding a propensity score caliper to the distance matrix, and
(iii) finding an optimal match.

Matching in R created largely due to the efforts of Ben Hansen (B. B. Hansen 2007), who created an R function, fullmatch, to do optimal matching from Fortan code created by Demetri Bertsekas D. P. Bertsekas (1981).

Research questions

Does financial aid increase college attendance?

Goal

Four matched samples will be constructed: The construction of one of the four matched samples is presented in step-by-step detail.

Context

Data

Data Xb with 2820 rows and 8 columns:
- faminc: family income in units of $10,000
- incmiss: income missing (1 if family income is missing, 0 o.w.)
- black: 1 if black, 0 o.w.
- hispanic: 1 if hispanic, 0 o.w.
- afqtpct: the Armed Forces Qualifications Test
- edm: mother’s education (1 for less than high school, 2 for high school, 3 for some college, 4 for BA degree or more)
- edmissm: mother’s education missing (1 if missing, 0 o.w.)
- female: gender (1 for female, 0 for male)

Imputation: faminc is set to 2 when incmiss = 1 indicating that faminc is missing, and edm is set to 0 when edmissm=1 indicating that edm is missing

AFQT was missing for less than 2% of subjects, and these subjects are not used in the matching. With this restriction, there are 131 high school seniors with deceased fathers and 2689 other high school seniors in the 1979-1981 cohort (131+2689 = 2820)

The treatment zb, where is 1 if the father is deceased, and 0 o.w.
data("dynarski")
dim(dynarski)
[1] 2820   10
kable(head(dynarski, n=20), caption="First 20 people in the 1979-1981 cohort, the treatment zb and the eight covariates Xb") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 1: First 20 people in the 1979-1981 cohort, the treatment zb and the eight covariates Xb
id zb faminc incmiss black hisp afqtpct edmissm edm female
1 0 5.3497560 0 0 0 71.901660 0 2 0
2 0 2.4627060 0 0 0 95.158050 0 2 1
4 0 9.4876030 0 0 0 95.785250 0 2 1
5 0 0.4388592 0 0 0 90.617160 0 2 1
6 0 10.4363600 0 0 0 81.761160 0 2 0
7 0 6.2694180 0 0 0 97.917710 0 4 0
10 1 3.2204620 0 0 0 61.916710 0 2 1
11 0 9.4719470 0 0 0 20.446560 0 2 1
12 0 2.7167480 0 0 0 57.175110 0 1 0
14 0 2.0000000 1 0 0 8.278976 0 1 1
15 0 7.1157020 0 0 0 50.928250 0 2 1
16 0 7.7669970 0 0 0 71.149020 0 2 1
17 0 12.5388400 0 0 0 74.912190 0 2 0
18 0 2.0000000 1 0 0 98.394380 0 2 1
21 0 9.4041260 0 0 0 98.720520 0 4 1
23 0 2.0000000 1 0 0 99.724040 0 4 0
24 0 2.0000000 1 0 0 73.281490 0 3 1
29 0 2.0000000 1 1 0 1.781234 0 1 0
31 0 2.0000000 1 0 0 42.900150 0 2 1
34 1 2.6090910 0 0 0 99.623680 0 2 0
#write.csv(dynarski, file="dynarski.csv")
zb<-dynarski$zb
zbf<-factor(zb,levels=c(1,0),labels=c("Father deceased","Father not deceased"))
table(zbf)
zbf
    Father deceased Father not deceased 
                131                2689 
Xb<-dynarski[,3:10]

\(\Rightarrow\) The 131 seniors with deceased fathers will each be matched to 10 controls whose fathers were not deceased.

Covariates with Missing Values

with(dynarski, table(incmiss))
incmiss
   0    1 
2282  538 
incmiss.per<-round(table(dynarski$incmiss)[2]/dim(dynarski)[1]*100,1)
paste0("Percentage of income missing: ", incmiss.per)
[1] "Percentage of income missing: 19.1"
with(dynarski, table(edmissm))
edmissm
   0    1 
2704  116 
edmissm.per<-round(table(dynarski$edmissm)[2]/dim(dynarski)[1]*100,1)
paste0("Percentage of mother's education missing: ", edmissm.per)
[1] "Percentage of mother's education missing: 4.1"

Discussion of the missing covariate values will be later on another post.

Propensity Score

The propensity score is estimated using a logit model.
# Estimate the propensity score
p <- glm(zb ~ Xb$faminc + Xb$incmiss + Xb$black + Xb$hisp + Xb$afqtpct + Xb$edmissm + Xb$edm + Xb$female, family = binomial)$fitted.values

\(\Rightarrow\) The vector p contains the 2820 estimated propensity scores, \(\hat{e}(\mathbf{x}_l)\), \(l\) = 1, 2, …, 2820.

It is reasonable to be able to improve the model: perhaps including interaction terms or transformations or polynomials or whatnot.

Estimated propensity scores for 131 seniors with deceased fathers and 2689 other seniors in the 1979-1981 cohort, before the Social Security Student Benefit Program was eliminated.
boxplot(p~zbf, ylab="Propensity score", main="1979-1981 Cohort")

dynarski$p <- p
ggplot(dynarski, aes(p, fill = zbf)) + 
  geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') +
  ggtitle("Figure of Overlay Histograms of PS in 2 Groups") +
  theme_bw()

\(\Longrightarrow\) the median \(\hat{e}(\mathbf{x})\) in the treated group (0.064) is about equal to the upper quartile among potential controls (0.063). Nonetheless, the distributions in Figure of Overlay Histogram of PS in 2 Groups overlap substantially, so matching appears to be possible.

Distance Matrix

Constructing in 2 steps:
- Step 1: computing the rankbased Mahalanobis distance \(\rightarrow\) 131x2689 distance matrix
#Robust Mahalanobis distance matrix, treated x control
dmat<-smahal(zb,Xb)
dim(dmat)
[1]  131 2689
kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix using the rank-based Mahalanobis distance") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 2: First five rows and columns of the 131×2689 distance matrix using the rank-based Mahalanobis distance
1 2 3 4 5
7 3.86 2.61 5.06 6.86 6.72
20 3.47 3.03 7.58 6.23 5.82
108 9.60 19.47 20.02 24.62 13.03
126 6.81 8.05 12.93 10.74 9.88
145 8.70 15.09 17.74 18.86 12.37
#Add a caliper on the propensity score using a penalty function
dmat<-addcaliper(dmat,zb,p,caliper=.2)
dim(dmat)
[1]  131 2689
kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix after adding the propensity score calipers") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 3: First five rows and columns of the 131×2689 distance matrix after adding the propensity score calipers
1 2 3 4 5
7 18.60 20.64 42.04 79.91 46.10
20 46.32 3.03 72.66 51.18 73.30
108 82.94 47.40 115.60 39.07 111.01
126 57.81 13.64 86.16 47.54 85.51
145 8.70 54.51 33.32 113.31 30.34

\(\Rightarrow\) Among these 25 entries, only 2 respected the caliper and did not incur a penalty (3.03 and 8.70).

Constructing the Match (B. B. Hansen 2007)

Matching ten controls to each senior with a deceased father: The fullmatch function needs to know the distance matrix, here dmat, matching 10-to-1 means 10x131 will be used, omitting 2689−10x131 = 1379, which is the omit fraction be: 51%.

m<-fullmatch(dmat, data=dynarski, min.controls=10, max.controls=10, omit.fraction=1379/2689)

There is an entry in m for each of the 2820 seniors

length(m)
[1] 2820

The first ten entries in m are

m[1:10]
    1     2     3     4     5     6     7     8     9    10 
1.129 1.100  <NA>  1.54  <NA> 1.121 1.114  <NA> 1.111  1.87 

i.e. the first senior of the 2820 seniors is in matched set #34 and the second senior is in matched set #10. The third senior was one of the 1379 unmatched controls; this is the meaning of the zero in m.01. The fourth senior is in matched set #87, the fifth is unmatched, and so on.

The function matched(·) indicates who is matched. The first ten entries of matched(m) are

matched(m)[1:10]
    1     2     3     4     5     6     7     8     9    10 
 TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE 

the first two seniors were matched but the third was not, and so on.

There are 1441 matched seniors, where 1441 = 131×11, because the 131 seniors with deceased fathers were each matched to ten controls, making 131 matched sets each of size 11.

sum(matched(m))
[1] 1441
sum(matched(m))/11
[1] 131

The first 3 matched sets are in Table below. The first match consists of 11 female high school seniors, neither black nor hispanic, whose mothers had a high school education, with family incomes between $30,000 and $40,000, mostly with test scores between 59% and 77%. In the second matched set, incomes were lower but test scores were higher. And so on.

# Housekeeping
im<-as.integer(m) # matched set from 1 to 131
dynarski<-cbind(dynarski,im) 
dm<-dynarski[matched(m),] # only select matched cases, note that matched(m): T\F
dm<-dm[order(dm$im,1-dm$zb),] # sort data according to matched set then zb decreasing

# Table of matched set example
#which(dm$id==10) # [1] 188
#which(dm$id==396) # [1] 23
#which(dm$id==3051) # [1] 1068

kable(rbind(dm[188:198,-c(11,12)], dm[23:33,-c(11,12)],dm[1068:1078,-c(11,12)]), caption="The 3 of 131 matched sets, each set containing one treated subject and 10 matched controls") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(#tab:matched set ex)The 3 of 131 matched sets, each set containing one treated subject and 10 matched controls
id zb faminc incmiss black hisp afqtpct edmissm edm female
7 10 1 3.220462 0 0 0 61.91671 0 2 1
239 350 0 3.557851 0 0 0 59.58355 0 2 1
252 365 0 3.599340 0 0 0 61.23934 0 2 1
322 465 0 3.557851 0 0 0 56.37230 0 2 1
361 518 0 3.788779 0 0 0 67.00953 0 2 1
383 550 0 3.788779 0 0 0 63.49724 0 2 1
928 1294 0 3.557851 0 0 0 67.31059 0 2 1
1481 2072 0 3.788779 0 0 0 64.97742 0 2 1
1489 2082 0 3.788779 0 0 0 63.62268 0 2 1
1560 2183 0 3.970631 0 0 0 76.51781 0 2 1
2810 3965 0 3.761650 0 0 0 72.57903 0 2 1
268 396 1 2.371901 0 0 0 88.50979 0 2 1
2 2 0 2.462706 0 0 0 95.15805 0 2 1
101 147 0 2.273267 0 0 0 77.09483 0 2 1
375 537 0 2.595314 0 0 0 95.96086 0 2 1
665 933 0 2.846281 0 0 0 96.11139 0 2 1
698 974 0 1.897521 0 0 0 99.59859 0 3 1
707 987 0 2.134711 0 0 0 81.18414 0 2 1
1394 1947 0 2.050298 0 0 0 91.44506 0 3 1
1518 2124 0 2.298786 0 0 0 72.40341 0 2 1
1877 2618 0 2.211560 0 0 0 68.91621 0 2 1
2816 3975 0 2.371901 0 0 0 90.74260 0 2 1
2182 3051 1 3.409901 0 0 1 62.87004 0 1 0
427 606 0 4.179612 0 0 1 81.73608 0 1 0
472 664 0 4.388592 0 0 1 91.57050 0 1 0
628 884 0 2.846281 0 0 1 48.77070 0 1 0
713 995 0 3.134709 0 0 1 55.11791 0 1 0
724 1008 0 3.320661 0 0 1 51.60562 0 1 0
1006 1399 0 3.439256 0 0 1 90.01505 0 2 0
2087 2908 0 3.795041 0 0 1 80.28098 0 1 0
2328 3262 0 3.788779 0 0 1 57.27546 0 1 0
2424 3400 0 2.925728 0 0 1 80.88309 0 2 0
2576 3624 0 3.320661 0 0 1 74.08430 0 1 1

Checking Covariate Balance

Note that: for covariate \(k\),
- \(\bar{x}_{tk}\) is the mean in the first group in the comparison,
- \(\bar{x}_{ck}\) is the mean in the second group in the comparison, and
- \(\bar{x}_{mck}\) is the mean in the matched comparison group formed from the second group
- \(sd_{bk}\) is the standardized difference in means before matching and
- \(sd_{mk}\) is the standardized difference in means after matching; they have the same denominator but different numerators

#fd <- dynarski[dynarski$zb==1,-c(1,2,12)]
#round(apply(fd, 2, FUN=mean),2)

d.tk<-dynarski %>%
  filter(zb==1) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im)
x.tk<-d.tk %>% summarise_all(mean, na.rm=TRUE)
s.tk<-d.tk %>% summarise_all(sd, na.rm=TRUE)

d.ck<-dynarski %>%
  filter(zb==0) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im)
x.ck<-d.ck %>% summarise_all(mean, na.rm=TRUE)
s.ck<-d.ck %>% summarise_all(sd, na.rm=TRUE)

sd.bk<-abs(x.tk-x.ck)/sqrt((s.tk^2+s.ck^2)/2)

d.cmk<-dm %>%
  filter(zb==0) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im) 
x.cmk<-d.cmk %>% summarise_all(mean, na.rm=TRUE)
sd.cmk<-abs(x.tk-x.cmk)/sqrt((s.tk^2+s.ck^2)/2)

set<-round(rbind(x.tk,x.cmk,x.ck,sd.bk,sd.cmk),2)
row.names(set) <- c("x.tk","x.cmk","x.ck","sd.bk","sd.cmk")

kable(set, caption="Balance on covariates before and after matching") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 4: Balance on covariates before and after matching
faminc incmiss black hisp afqtpct edmissm edm female p
x.tk 2.78 0.15 0.35 0.15 49.58 0.08 1.62 0.49 0.07
x.cmk 2.77 0.15 0.34 0.15 49.10 0.04 1.61 0.50 0.06
x.ck 4.58 0.19 0.29 0.15 52.39 0.04 1.91 0.50 0.05
sd.bk 0.71 0.11 0.13 0.02 0.10 0.19 0.33 0.03 0.67
sd.cmk 0.00 0.00 0.02 0.03 0.02 0.19 0.02 0.02 0.09

\(\Longrightarrow\) They were quite different before matching but were much closer after matching. The family income of seniors with deceased fathers was lower, they were more often black, and their mothers had less education. Between 1979-1981, AFQT test scores decline. The imbalances are much smaller after matching.

Outcomes

The outcomes were not provided in the Dynarski’s dataset. But I guessed we can report the proportions and MH-odds ratio between 2 groups after matching.

In short, during 1979-1981, when the Social Security Student Benefit Program provided tuition aid to students of a deceased Social Security beneficiary, seniors with deceased fathers were more likely to attend college and complete one year of college than were matched controls, with an odds ratio of about 1.65, but there is no sign of this in 1982-1983 after the program ended (the later can be checked because of lack the data for analysis).

Standardized differences in covariate means before and after matching in matched comparisons. The boxplot displays standardized differences in means, for the nine covariates and the propensity score.

m<-c(0,1)
d.sd<-rbind(sd.bk,sd.cmk)
d.sd<-cbind(d.sd,m)
d.sd.long<-d.sd %>%
  pivot_longer(-m,names_to="absstdzdiff", values_to="sd")
d.sd.long$m<-factor(d.sd.long$m,levels=c(0,1),labels=c("Unmatched","Matched"))
boxplot(sd~m, data=d.sd.long, xlab="", ylab="Absolute Standardized Difference", main="1979-1981 Cohort: FD vs. FND")

Further reading

Bertsekas. 1991. Linear Network Optimization. Book. Cambridge, MA: MIT Press. http://web.mit.edu/dimitrib/www/net.html.
Bertsekas, Dimitri P. 1981. “A New Algorithm for the Assignment Problem.” Journal Article. Mathematical Programming 21 (1): 152–71. https://doi.org/10.1007/BF01584237.
Dynarski, Susan M. 1999. “Does Aid Matter? Measuring the Effect of Student Aid on College Attendance and Completion.” Journal Article, 1 online resource. http://HZ9PJ6FE4T.search.serialssolutions.com/?V=1.0&L=HZ9PJ6FE4T&S=JCs&C=TC_008322359&T=marc&tab=BOOKS Available only to UIC users.
Hansen, B. B. 2007. “Optmatch: Flexible, Optimal Matching for Observational Studies.” Journal Article. R News 7: 18–24.
Hansen, Ben B., and Stephanie Olsen Klopfer. 2006. “Optimal Full Matching and Related Designs via Network Flows.” Journal Article. Journal of Computational and Graphical Statistics 15 (3): 609–27. www.jstor.org/stable/27594200.
Ming, Kewei, and Paul R. Rosenbaum. 2001. “A Note on Optimal Matching with Variable Controls Using the Assignment Algorithm.” Journal Article. Journal of Computational and Graphical Statistics 10 (3): 455–63. www.jstor.org/stable/1391099.
Rosenbaum, P. R. 1991. “A Characterization of Optimal Designs for Observational Studies.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 53 (3): 597–610. www.jstor.org/stable/2345589.
Rosenbaum, Paul R. 1989. “Optimal Matching for Observational Studies.” Journal Article. Journal of the American Statistical Association 84 (408): 1024–32. https://doi.org/10.2307/2290079.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2021, July 3). HaiBiostat: Matching in R. Retrieved from https://hai-mn.github.io/posts/2021-07-03-matching in R-causal-inference/

BibTeX citation

@misc{nguyen2021matching,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Matching in R},
  url = {https://hai-mn.github.io/posts/2021-07-03-matching in R-causal-inference/},
  year = {2021}
}